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Abstract 

We introduce a unified framework to incorporate risk in Markov decision processes (MDPs) , 
via prospect maps, wfiidi generalize tfie idea of colierent/convex risk measures in matliemat- 
ical finance. Most of tfie existing risk-sensitive approadies in various literature concerning 
witli decision-making problems are contained in the framework as special instances. Within 
the framework, we solve the optimal control problems according to two criteria, the newly 
invented temporal discounted criterion, which generalizes the conventional discount scheme, 
and the average criterion, by value iteration algorithms under different assumptions. Two 
online algorithms are proposed to solve the optimal controls problem when the exact MDP is 
unknown and has to be estimated during optimization. 

1 Introduction 

In many applications of decision- making problems modeled by Markov decision processes (MDPs) , 
it is reasonable to incorporate some measure of risk to rule out policies that achieve a high expected 
reward at the cost of risky and error prone actions. If we think for example of an expensive 
manufacturing machine that has two running modes: one where the machine runs at peak level 
and produces the maximum number of products for most of the time at the cost of a high chance 
for a serious damage and one where the machine runs slightly slower to avoid damage. Most 
companies would agree that the second option is more reasonable. Yet, if the company would 
make decision with the help of the classical MDPs, it would pick option one and go for the risky 
strategy. 

Most of the decision-making models like MDPs, are consisted with two descriptions of some 
mechanism of environments, immediate outcomes (rewards or costs) at one state by performing 
one action, and transitions, the transition probability between states with some actions. Both de- 
scriptions are objective in the sense that both outcome and transition probability can be estimated 
by repeating experiencing the environment sufficient many times. The "risk" depends, however, on 
the subjective perception of the agent, since different agents might have different risk-preferences 
facing the same environment. For instance, $100 is more valuable for the poor than for the rich. 
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Behavioral experiments |21| show that people tend to overreact to small probability events, but 
underreact to medium and large probabilities. 

Due to the apparent usefulness of risk-sensitive objectives, the topic is of major importance in 
finance and economics. In economics, the utility function is widely used to model the subjective 
perception of rewards. The renowned prospect theory (PT) ^ST introduces the probability weighting 
function to model the subjective perception of probabilities. PT can be merely used to model sin- 
gle decision problem, whereas in MDP a sequence of decisions have to be made. In mathematical 
finance, Ruszczyhski (2010) [28' applies coherent /convex risk measures (CRMs) pi lll| to incor- 
porate risk in a sequential decision-making structure. However, there are two major drawbacks in 
their work; 1) he assumes that the risk measures must be coherent or convex, which is not true for 
some of the most important instances of risk measures, and 2) he discusses merely the finite-stage 
or discounted risk problem for coherent risk measures. The theory of discounted and average risk 
for arbitrary measures as in the classical MDP have not been considered yet. 

In the community of MDPs (mainly operations research and control theory), despite the ap- 
parent usefulness of risk-sensitive measures, few works in MDPs address the issue, since many 
risk-sensitive objectives cannot be optimized efficiently. The mean-variance trade-off is a popular 
risk criterion, where variance takes the part of the risk measure as it penalizes highly varying re- 
turn. However, this objective is difficult to optimize, especially when a discount factor is included 
[TU] . Recently in [23] the problem even for finite- horizon MDP is proved to be NP-hard. Another 
popular measure is to apply the exponential utility function. Although an efficient solution (see 
e.g. [S]) exists for average infinite- horizon MDP, it is proved in [7 that the objective for discounted 
MDP is difficult and the optimal policy might not be stationary. 

The question is now if all the risk-sensitive objectives are difficult to optimize for MDPs or if 
measures like the mean-variance trade-off are just not the "right" measure for MDPs. Inspired by 
the discovery in mathematical finance and economics, our intuition is therefore to adapt the CRM 
theory to the MDP structure, where two concerns must be balanced: 1) the axioms should be as 
general as possible to be able to model all kinds of risk-preferences including mixed risk-preference, 
and 2) the underlying optimization problem can be solved by a computationally feasible algorithm. 

The main contributions of this paper are: 1) To incorporate risk into MDPs, we set up a general 
framework via prospect maps, which is a generalization of the CRMs. The framework contains 
most of the existing risk-sensitive approaches in economics, mathematical finance and optimal 
control theory as special cases (cf. Sec. [S|). 2) Within the framework, we define a novel temporal 
discount scheme, which includes the conventional temporal discount scheme as special cases. The 
optimization problem to the new discounted objective function is proved to be solved by a value 
iteration algorithm; 3) We investigate the optimization problem of the average prospect. With 
one additional assumption, the solution to its optimization problem exists and a value iteration 
is designed to solve it; 4) For the case where the knowledge of MDP, reward and transition, is 
unknown, we state one algorithm to estimate the reward and transition models of underlying 
MDP and simultaneously learn optimal policy. For one specific prospect map (entropic map), a 
Q-learning like algorithm is proposed to obtain optimal policy without knowing the knowledge of 
MDP. 

In order to avoid tedious mathematical details in general state-action spaces, we consider cur- 
rently merely the MDPs with finite state-action space. However, the extensions to general space 
are straightforward. 

This paper is organized as follows. In Sec. [2] , we briefly introduce the setups of MDPs and 
prospect maps, which are adapted in Sec. El to the MDP structure. Sec. H states the major theory 
of this paper, the discounted prospect and average prospect, whose optimal control problems are 
solved by value iterations under different assumptions. In Sec. [S] we discuss the existing risk- 
sensitive approaches and show how to represent them with specific prospect maps. Two on-line 
algorithms, which might be of interest for engineering-oriented audience, are stated in Sec.[6l which 
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is followed by experiments with simple MDPs in the final section. 

2 Setup 

2.1 Markov Decision Processes 

A Markov decision process [26'^ is composed of a state space X, an action space A, a transition 
model Q and a reward model r. Both state and action space are assumed to be finite. The 
transition model Q{y\x^a) := ¥(Xt+i = y\Xt = x,At = a) denotes the probability of arriving 
at state y given the current state x with chosen action a at time t. We assume the transition is 
time-homogenous. The reward function r{x,a) : X x A M> M represents the reward obtained at 
state X if action a is chosen. 

The policy 7r((a|x) at time t is defined as the probability of choosing action a given state x. Let 
TT :~ [tto, TTi, . . .] be the sequential policy where at time t — Q the policy ttq is used, and at t = 1 the 
policy TTi, etc. Let H be the set of all policies. A policy is called Markov if for all i £ N, ttj depends 
merely on xt and is independent from all the states and actions before time t. Let Hm denote the 
set of all Markov policies and A be the set of all one-step Markov policy. Thus Hm = A°°. A 
one-step policy / € A is called Markov deterministic, if f{a\x) = 1 for some a G A and x G X. 
With slight abuse of the notation, we also write / as a deterministic function f{x) = a. Denote 
the set of all one-step Markov deterministic policies by A^i C A. For any tt G A, we define 

r^x) :=^7r(a|x)r(x,a),P^(2/lx) ^ ^(ala:)g(2/|a:, a) (1) 

a a 

There are usually three types of objectives functions used in the literature of MDPs, finite-stage, 
discounted and average reward. We summarize them as follows, 

T oo 

St ■■^y^r{Xt,At),Sa ■.= y^a*r{Xt, At), and S -.^ lim -St (2) 

t=0 t=0 

where a G [0, 1) denotes the discount factor. Suppose we start from one given state Xq — x. The 
optimization problem is to maximize the expected objective by selecting a policy tt: 

maxE'' [S\Xo = x] (3) 
where S can be replaced by St, Sa or 60- 

2.2 Dynamic Prospect Maps 

In the setup of MDPs, we apply "rewards" instead of "costs" (which are common in the literature 
of Markov control processes PJ3J) to model immediate outcomes and therefore in the optimization 
problems of MDPs (Eq.|31), objectives are to be maximized rather than minimized. To be consistent 
with maximizing objectives, "prospect maps" are used to name analogous nonlinear structures as 
risk measures in finance literature. Similar nomenclature can be also found in 120!, where risk is 
replaced by valuation. 

^Note that since the hmit in defining the average reward S might not exist (see e.g. Example 8.1.1, [26]), the 
strict definition of the optimization problem of average reward should be 

maxliminf — E" \St\Xo = x] . 
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Let us consider a discrete-time stochastic process {Xt £ X}^q and Yt — {Xo,Xi, . . . ,Xt} S 
X*+^. The capital letters Xt and Yt denote random variables whereas the realizations of the 
random variables are denoted by normal letters, Xt and yt, respectively. Let ^t denotes the set of 
all real- valued bounded functions on X*+^, for t — 1,2,... We consider a map Rt{v\yt) such that 
R{v\-) is a real-valued bounded function on X*+^ for fixed v G ^t+i- Rt can be also viewed as a 
map from J^t+i to J^t- In the following, the (in-)equalities between two functions are understood 
elementwise, i.e., we say u < w, if v{x) < w{x) for all x. 

In the following, we first introduce conditional prospect maps and then construct a dynamic 
prospect measure from t to T, Rt^T ■ ^ ^t, by a series of conditional prospect maps {i?s}^=f 

Definition 2.1. A map Rt : ,^t+i ^ ^t, i G N U {oo}, is called a conditional prospect map, if 

I Monotonicity. \/v G ^t+i,Vw G ^t+i, if v <w, then Rt(v) < Rt{w). 

II Time- consistency. For any v G ^t+i o,nd Vw G ^t, Rt{v + w) = w + Rt{v). Especially, for 
each ui G K. and v G ^t+i, Rt{v + w) = w + Rt(v). 

Ill Centralization. Rt{0) = 0. 

Remarks The monotone axiom reflects the intuition that if the reward of one choice are higher 
than the reward of another choice, the prospect of the choice must be higher than that of the other 
one. The time-consistent axiom is obviously a generalization of the conditional expectation. This 
axiom allows the temporal decomposition (see Proposition 12.11) , and together with the axiom of 
monotonicity make the dynamic programming [3] the feasible solution to the optimization problems 
(see Sec. H]). The axiom of centralization sets the reference point to be 0, i.e., there is no risk if 
there is no cost. Nevertheless, it is possible to use other reference points. 

Definition 2.2. A map Rt,T ■ h> 3't, 0<i<TGNU {cxd}, is called a dynamic prospect 
map, if there exists a series of conditional prospect maps {Rt}^^t such that 

Rt,T{v) := Rt{Rt+i{. ■ . Rt-i{v) . . .)), V G .^t- 

Proposition 2.1. Let e .-^s, t < s < T, t,T e NU {oo}, t <T, and v = X^Lt Vs e =^t, we 
have 

Rt,T{v) =vt + Rtivf+i + . . . + Rt-i{vt) ■ ■ ■) 
Proof. Trivial using Axiom IL □ 

Remarks. In the literature of finance, there exist various ways to extend the CRM to a 
temporal structure (e.g., [H UHl HI [2E] and references therein). The definition is usually selected 
based on the applications, to which the dynamic risk measures are applied. To compare their subtle 
differences are out of the scope of this paper. Nevertheless, there are 2 points that are remarkable: 
1) in all kinds of definitions, the axiom of time-consistency is the most important component that 
allows the temporal decomposition as shown in Prop. 12.11 and 2) their definitions require either 
coherence [25] or convexity [HI [201 [I] , which means that the agent has to be economically rational, 
i.e., risk-aversive (more discussion see Sec. l3.3p . However, in some problems (especially in modeling 
real human behaviors), mixed risk-preference (risk-aversive at some states while risk-seeking at 
other states) is also a possible strategy. For instance, at gambling, some people are risk-aversive 
when losing money but risk-seeking when winning money. Therefore, we require neither coherence 
nor convexity. In this sense, our axioms are even more general than the axioms used in finance 
literature. Finally, in the literature of coherent risk measures, non-additive measures can be defined 
due to the coherency. However, in this paper we do not assume coherency in the axioms. Instead, 
we build the theory based on the functional spaces {^t}- Therefore, it is more accurate to use the 
term "map" than "measure". 
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3 Applying Prospect Maps in MDPs 



The dynamic prospect maps introduced in Sec. l2.2l can be adapted to arbitrary temporal structures. 
To adapt in the structure of MDPs, we assume the prospect maps work on the state sequence 
{Xt}'^Q. On the other hand, since the probabihty of {Xt}^Q is controUed by pohcies tt G 11 
(together with the transition model of the MDP, Q, defined in Sec. 12. ip . we assume further that 
the prospect maps depends on tt. Thus, the conditional prospect maps working on the MDP 
(X, A, r, Q) given one policy tt are written as {Rf}- 

3.1 Markov Prospect Maps for MDPs 

The conditional prospect maps defined in Def. 12.11 might be dependent on the whole history, 
which could cause computational problems in real applications. Therefore, the prospect maps 
are additionally assumed to possess Markov property. Let J^b denote the space of all bounded 
functions that maps from X to M. 

Definition 3.1 (Markov Prospect Map for MDPs). Let {RY} be a series of conditional 
prospect maps defined on the MDP (X, A, r, Q) given the policy tt. {R^} is called Markov, if there 
exists a series of maps {gt '■ ^ ^s} such that 

RY{v{Xt+i)\xt,xt-i,...,xo) = gt{v\xt),'it €N,v€.^b 

Remark. It is noticeable that the prospect map {i?^} depends also on Q. 

From now on, we consider merely the Markov prospect maps. Thus we can write R^ as 
R'Y{v{Xt+i)\xt). Furthermore, we consider merely the Markov policies tt € IIm- For a Markov 
random pohcy tt — [ttq, tti, . . .] G IIm, RT {''^{^t+i)\xt) depends only on ttj G A. Hence, we 
can write R^ {v{Xt+i)\xt) as R^* {v{Xt+i)\xt). For each (xt, at)-pair, there exists a corresponding 
deterministic policy / G Ad C A satisfying f{at\xt) = 1. Therefore, we can define for each (xt, at), 

Rt{v{Xt+i)\xt,at) := R{ iv{Xt+i)\xt) (4) 

Assumption 3.1. We assume that the Markov prospect map R^* is linear to TTt, i-e., 

RriviXt+i)\xt) = J2Ma\xt)Rtiv{Xt+i)\xt,a),yt eN. 

aeA 

To simplify the problem, we consider merely the time-homogeneous Markov prospect maps, i.e., 
Rt = R for all t. Hence, R""' {v{Xt+i)\xt) can be abbreviated by R'^*{v{X')\x), x e X, v G ^b 
and furthermore by R'^'^(v\x). Similar abbreviations are used for R(v\x,a) which is a special case 
of R'^{v\x). By Assumption 13. 1[ analogous to the P'^ in Eq. [1] we obtain 

R^'lvlx) = TT{a\x)R{v\x,a) 

aeA 

Then R'^{v), which is defined by R'^{v){x) := R'^{v\x), is a function in the space J^b- R^ can be 
viewed as a map from .^b to ^b itself. Since we assume the state space is finite, v can be viewed 
as a A^-dimensional vector, where N denotes the number of states. Thus R'^ can be understood as 
a map from to itself. 

Remark. For a time-homogeneous Markov map R, Assumption 13.11 enables R{v\x,a) to play 
the similar role as the transition model Q in MDPs. Another result of Assumption [XT] is that for 
all V G and x G X, there exists a deterministic policy / G A^, such that for any a G [0, 1], 

(x) + aRf {v\x) = c{x, f{x)) + aRiv\x, f{x)) = min {c'^ix) + aR'^{v\x)} . 

ttG A 
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3.2 Nonexpansiveness 



For any time-homogeneous Markov map R, by its definition, satisfies the axioms of monotonicity 
and time-consistency, for each tt G A. Tlius R'^ is a topical map (see [12j). which satisfies, i) 
R^'iv) < R'^{w), whenever v <w e M^; and ii) R^'iv + w) =w + R''{v), for all v e and w e R. 
For each v E ^b, we define the Hilbert semi-norrn^ and sup-norm as follows, 

llwllif := snp {v{x) ~ v{y)), ||t;||oo := sup|i;(a;)|. 

x.y£X xeX 

Since we consider only the finite state space, v is simply an A^-dimensional vector. 

Suppose F : M.^ i— >■ be a topical map. Then, it can be shown that F is nonexpansive under 
both Hilbert semi-norm and sup-norm (see Eq. 17 and 18, ^2\), i-e., for all v,w E M^, 

\\F{v) ~ F{w)\\h <\\v- ||F(i;) - F{w)\\^ <\\v- w\U 



3.3 Categorization 

Suppose R^ (v\x) is a time-homogeneous Markov prospect map for some one-step Markov policy 
TT G A. Assume furthermore RJ^[v\x) is concave with respect to v at a;, i.e. for any v and w G R^ 
and any /3 e [0, 1], we have 

i?''(/3w + (1 - f)w\x) > /3i?^(w|x) -I- (1 - l3)R'^{w\x) 

Note that the objective is to maximize the prospect (which will be defined Sec. Suppose 
we have two policies tti and tt2 in the successive time-step which generate two outcomes v and 
w respectively. The concavity of R'^{v\x) implies that the outcome of mixture of two policies, 
R'^{Pv-\- (1 — j3)'w\x) is always preferred (due to maximization) to the mixed outcome of two single 
policies l3R'^{v\x) -|- (1 — P)K^ {'w\x). In other words, given the policy tt we choose at current 
time step, we shall prefer mixture of two policies at the successive time-step. This shows that the 
corresponding risk-preference of the prospect map is risk- aver sive. Similar result can be inferred 
for convex prospect maps. This categorization coincides categorization of risk-preferences judging 
by concavity of the utility functions in the expected utility theory |13) . In order to obtain a 
time-homogeneous risk-preference (risk-aversive or risk-seeking) , the everywhere risk-preference is 
required. We define them as follows. 

Definition 3.2. A time-homogeneous Markov prospect map R'"{v\x) is said to he 

(i) risk-aversive x G X, if it is concave w.r.t. v at x, and everywhere risk-aversive, if R^{v\x) 
is concave w.r.t. v at all x G and for all t: G A. 

(ii) risk-seeking ai a; G X, if R'^{v\x) is convex w.r.t. v at x, and everywhere risk-seeking, if 
R'"{v\x) is convex w.r.t. v at all x E ^ and for all ir £ A. 

Remarks The categorization depends on the objective. In the CRM theory, the objective is 
to miminize the risk. Therefore, the categorization is opposite: concavity means risk-seeking and 
convexity suggests risk-aversive. Apparently under Assupmtion 13 . 1 1 if R{v\x, a) is convex (concave) 
w.r.t. V at all (x, a)-pairs, then R^(v\x) is everywhere convex (everywhere concave). Several existing 
risk maps (see Sec. [5]) in the literature confirm also the above defined categorizations. 

One widely used family of prospect maps, the coherent prospect maps, is worth mentioning. 

Definition 3.3. A time-homogeneous Markov prospect map R'"(y\x) is said to be coherent if for 
all A > 0, R'^iXvlx) = XR'^iXvlx) for all tt e A, v E and x G X. 

^Here we follow the terminology in | 12l I24| . whereas the same semi- norm is called span semi-norm in | 26l I15| . 
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4 Discounted and Average Prospect 
4.1 Finite-stage Prospect 

According to the definition of dynamic prospect maps (Def. 12. 2p . we define the T-stage total 
prospect as follows, 



Jt{x,7t) ■.= RlTiY^riXt,At)\X„ = x) 



(5) 



Suppose the prospect map {Rt} under consideration is time- homogeneous and Markov. By 
Prop. 12. 1[ we have the following decomposition 

Jt(x, tt) = (x) + Rl"^^^ [r-^ (X,) + Rl\ [r^^ (X,) + . . . + i^^J^ [r^- (Xr)] . 

where the short notation R'^^{v{Xt+i)) := R^^{v{Xt+i)\Xt) is used. The optimization problem 
of this objective function is to maximize the T-stage total prospect among all Markov random 
policies, i.e., 

J^{x) — max JT{x,7r) 

Suppose Assumption 13 . 1 1 holds true. Obviously, the optimization problem can be solved by dynamic 
programming, i.e., we start from 

Vt{x) =maxr'^(x) =maxr(a:, a) 

ttGA aeA 



Then we calculate backwards, fort = r — 1,T — 2, 



,0, 



Vt{x) = max{r''(a;) + R'^ {Vt+i\x)} = max{r(a;,a) + R{Vt+i\x , a)} 



It is easy to verify that Vq{x) = J^{x). 



4.2 Discounted Total Prospect 

Let a e [0, 1) denote the discount factor. Suppose Assumption 13.11 holds true. We use the 
discounted T-stage prospect as follows, 



J„,T(x,7r) r^«{x)+aR2,=, \r^' (Xi) + aRl\ [r"M^2) + ■ 
and the discounted total prospect as 

jQ(a;, 7r):= lim JaT{x,Tr) 

T— s-oo 

Thus, the optimization problem for discounted total prospect is 

J*(x) sup Ja{x,Tz) 



(6) 



(7) 



We first prove that the limit exists in Eq. [T) Given vr G A, we define the map 
as F^{v\x) r^ix) + R^ivlx), w e and a; e X. For any tt e Um, define 



'IT . nN 



pN 
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Proposition 4.1. For any tv G Hm, i) the limit in Eq. exists; ii) Yvoit ^oo F'^xi'^) ^ Jai"^), 

Proof, (i) Since r is bounded for finite state and action spaces, there exists a number M < oo such 
that |r(.x,a)| < M for all {x,a). Hence, by monotonicity and additive property of i?, 

-a^+^M < J„,T+i(a;,7r) - Jo,^t{x,tv) < a^+^M 

which implies \Ja.T+iix,Tv) — Ja.Tix,TT)\ — >■ as T — t- oo. 

(ii) Since v G M^, w — r'^ is also bounded for all tt e A. Let M' be the upper bound such that 
\\v - r'^lloo < M'. Hence, 

- M' <v <r'^ + M' =^ -M'a^ < F^j.{v) - Ja,T{x, tt) < M'a^ 
Using the conclusion of (i), we have limT^oo Fari"^) — Ja{T^), Vt; G R^. □ 
Discussion The trivial extension of the classical discounted MDP (cf. Sa in Eq. [2|) is as follows, 

oo 

t=o 

Using the time-consistency property of prospect maps, we have the following decomposition 
(x, tt) = (x) + Rll^^ [ar-i (Xi) + Rl\ [aV^^ (X2) + . . . + RZ\\ [a^^"" (Xt) + ■■■]■ 

We have the following observations: 

• We can prove analogously as in Prop. I4.ir i) that Da is well-defined. 

• If the prospect map R is coherent, then is equivalent to Jq. (cf. Eq. [7]), the discounted 
total prospect under our definition. Therefore, defined for any coherent prospect map 
is merely special cases of our definition. Especially, the discounted total reward in classical 
MDPs is a special case of the discounted total prospect, since it is coherent. 

• For general prospect maps, there might not exist a stationary policy that Da, as proved by 
Chung & Sobel (1987) [7^ for entropic prospect maps, which are not coherent. We can prove 
analogous statements as Theorem 4 in [7] for arbitrary non-coherent prospect maps. 

• Ruszczyhski (2010) [28] uses Da as the objective function, which was solved by a value 
iteration algorithm. However, in the proof of the value iteration algorithm, he uses the 
representation theorem which is valid merely for coherent prospect maps. On the contrary, 
we will see later that the objective Ja allows a value iteration algorithm for arbitrary prospect 
maps. 

Contracting Map Given a function u G and a; G X, consider the following map 



Fa(u\x) maxi^^(u|a;) = max [r(x, a) + aR(u\x, a)] (under Assumption 13.1 
Now we prove the key property: Fa is a contracting map. 



Lemma 4.1. Suppose Assumvtion \3.1\ holds true. Then Fa is a contracting map under sup-norm, 
i.e., \\Fa{u) — Fa{v)\\oo < ct\\u — v\\oo, for all u and v G M^. 
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Proof. Under Assumption 13.11 there exist deterministic policy / and g satisfying 

Fa{u\x) = r{x,f{x)) +aR{u\x,f{x)), Fa{v\x) = r{x,g{x)) + aR{u\x, g{x)) 

By definition, we liave for all x £ X, 

Fa{u\x) — Fa{v\x) <r{x, f{x)) + aR(u\x, f(x)) — r(x, f{x)) — aR{v\x., f{x)) 
=a [R{u\x, f{x)) ~ R{v\x, J{x))] < a\\u ~ v\\oo 

where the last inequality is due to the nonexpansiveness of R. Exchanging v and w, we have 

Fa{v\x) - Fa{u\x) < a\\v - u\\oo 

Thus, \Fa{u\x) — Fa{v\x)\ < a\\u — v\\oo for all a; 6 X, which implies II-F'q(u) — fQ(w)||oo < 
a\\u-v\\oo □ 

Value iteration We state the following algorithm: 

1. select one vq e M^, t = 0; 

2. calculate vt+i = Fa{vt), ft = argmaxFQ(wt) 

3. if ||wt+i — ftlloo < e, stop; otherwise, i 4= < + 1 and goto step 2. 

Since F^ is a contracting map, due to the Banach contraction mapping principle, we conclude that 
for all V € M.^ , vt — >■ v* and /t —>■/*, as t ^ oo, where v* is the fixed point of Fa such that 
Fa{v*) — V* and /* denotes the corresponding policy. The final step is to prove v* = J* with the 
following theorem. 

Theorem 4.1. Suppose Assumption lS.ll holds true. For any v € M^, i) if v > Fa{v), then v > J*; 
ii) If V < Fa{v), then v < J*; Hi) if v — Fa(v), then v = J*. 

Proof, (i) Consider a Markov policy tt [ttq, tti, . . .]. v > Fa{v) implies that for any tt G A, 

V > Fa{v) > r'^ +aR''{v) 

We apply above inequality recursively, 

V > r'^° + aR'^"{v) > r'^° + aR'">{r''^ + aR'^^v)) > . . . > Ja{n) 

Since tt is arbitrary, above inequality implies v > sup Ja(7r) = J*. 

(ii) Under Assumption l3.1[ there exists an / G A^i such that Fa{v\x) = r{x, f{x))+aR{v\x, f{x)). 
Write r^{x) :— r{x,f{x)) and R^ {v\x) := R{v\x,f{x)). Since v < Fa{v), we have 

V < Fl{v) =rf + aRf (v) < r^ + aR^ [r^ + aR^ {v)) <...< JM°°) < J» 

where we apply the monotonicity of R-^ recursively. Due to Prop. HTlT ii). for any tt e Hm, Jai-jT^) 
exists, (i) + (ii) implies (iii). □ 
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4.3 Average Prospect 

Analogous to the average reward S defined in Eq. [21 we consider the following average prospect, 

J{x, tt) :— liminf —Jt{x, tt), tt G Hm, x & X. 

T— i-oo 1 

where Jt{x,7t) is defined in Eq. [5l Here "liminf" is used to avoid the case where the limit of 
— Jt does not exist (see e.g., Example 8.1.1, [26 ). The optimization problem of average prospect 
is therefore, 

J*{x) = sup J{x,n) 

Suppose there is a pair {h, p),h G M^, p eM., which satisfies the following equation 

p + h{x) = max [r'^ix) + R'^{h\x)] (8) 



This equation is called average prospect optimality equation (APOE). Under Assumption 13. II there 
exists a deterministic function / 6 A such that 

p + h{x) — max [r{x, a) + R{h\x, a)] — r{x, f{x)) + R(h\x, f{x)) 

aeA 

Define operator F'^ as 

F'^iv) :=r^ + i?^(w),weM^ 

Let 77 = [ttq , TTi , . . .] e IIm be an arbitrary random Markov policy. Define 

F^{v) := F''"{F'''{...F''^-'{v)...)) (9) 

Lemma 4.2. Suppose the Assumvtion \3. 1\ holds true and the APOE has a solution {h,p), h S 
and p G R. Let f be the deterministic policy found in the APOE. Then p — J{x, f°°) — J*{x), for 
all a; £ X. 

Proof. We prove p = J{x,f°°) first. Define an operator F : 

^ follows, 

and F{F*{v)), t ~ 1,2,.... Hence, due to the nonexpansiveness of R^ , we have 

WMfn - F^{h)U < \\r^ - h\\oo ^ Imi^ (^Jt(/°°) - ^F^w) = (10) 

On the other hand, by APOE, we have 

F^{h) = F^-i(r^ + R^h)) = F^-\h) +p=... = h + T-p 

Hence, limT->oo ^F^{h) — p- Together with Eq. [TUl we obtain limT^oo ^JT{f°°) — P- 

Now we prove that J(x,tt) < p for any tt G Hm and all a: G X. By AROE, we have for all 

TT G A, 

p + h>r''+R'^{h) (11) 
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Let TT G Hm be any Markov random policy. Then defined in Eq. [3] satisfies, 

WMn) - F^{v)\U < \\r^ 'V\l^=^ \\m^ I^^Mtv) - ^F^iv)^ = (12) 
By Eq. (TTJ we have 

<F^_^{p + h) = F^_,ih) + p 
<...<h + T- p 



which implies 



liminf < p ^^hminf ljT(7r) < p 

T— ^oo 1 T— !-oo 1 



□ 



Now the question is to find proper assumptions that can guarantee the existence of the solutions 
of the APOE. Assumption l3.1l is not sufficient to take this burden. Recall that \\-\\h denotes Hilbert 
semi-norm defined in Secl3.2l We further assumes 



Assumption 4.1. There exists an integer K and a real number j3 G [0, 1) such that for all 
deterministic policy tt = [/g, /i, . . . , fn-i] G 

\\R^{u) - R^(v)\\h < (3\\u - v\\H,yu,v e 

where R'^{-) := Rf°{Rf^ . . . . .). 

Define the operator, F -.R^ ^R^, 

F{v\x) -.^msiyiirix.a) + R{v\x,a)} , F*{v) -.^^ F{F*-\v)),t = 1,2, .. . (13) 

aG A 

Proposition 4.2. If Assumvtion \3.1\ and \4.1\ hold true, then \\F^ (u) - F^{v)\\h < f3\\u-v\\H, 
for all u,veR^. 

Proof. Let F]^ be as defined in Eq. [9l There must be two pohcies 7r„ — [fo, /i, . . . , fK-i],T^v = 
[go,gi, ■ ■ ■,gK-i] e Ag satisfying F^^iu) = F^{u) and Fj^" (v) = F^{v) respectively. 

F^{u)-F'<{v) <F--{u)-F--{v) 

=Rf°{c^^ +Rf^{... + Rf''-\u)...))-Rf»{cf^ +i?^i(... + i?^*'-i(w) ...)) 

(Prop. [23]) ^Rf° {Rf^...Rf''-^{J2c^'+u)...))- R^" {R^^ (. . . ( ^ c-^* +«)•■•)) 

Exchange u and v, we have F'^ (v) - F'^ (u) < Fj^^{v) - F^^{u). Thus, 
||F^(^) - F^{v)\\h < max ||FjJ(u) - F^{v)\\ 

K-1 K-1 

= max ||i?''(^ c""' +u) - R'^iY^ c'^' +w)|| < P\\u-v\\h 

'^'=^D t = l t = l 

□ 
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Theorem 4.2. Suppose Assumvtion \3.1\ and \4-l\ hold true. Then there exist p G M and h G M.^ 

satisfying the APOE (Eq. and furthermore liuit-^oo (vq) — h for all Vq G M^, where F*" is 
defined in Eq. 

Proof. Let K be the integer in Assumption 14. II Then by Prop. 221 we have 

IIF^H ~ F^MIIff < /3\\u - v\\H,yu, V G 

Starting from an arbitrary point vq G M^, the iteratively computed sequence Vt+i = Fvt, t = 
0, 1,2,.. ., satisfies 

\\vtK+l - VikWh < P\\v(t-i)K+l - V(t-1)K\\H < ■ ■< - VoWh 

Thus, ||wti<:+i — wt/fll// — ?• as t — > oo. Since F is nonexpansive in Hilbert semi-norm, we have 
\\vt+i - Vt\\H < \\v[t/K]-K+i - V[t/K]-K\\H, where [t/K] := max{i eN,iK < t}. Thus, Vt G N, 

t t 

\\vt+i - voWh < - "'11^ ^ E 11^1 - ""'1^ ^ T^"''! ~ ""o"^ 

1=0 i=0 ' 

which imphes that for ah t, vt is bounded and therefore in the space E^, i.e., 

h :— lim Vt G exists 

Obviously, since ft is bounded, F{h) is bounded, as well as F{h) — h. Hence, due to the fact 
\\Fh — h\\H = 0, there exists a finite p G K satisfying 

p + h{x) = F {h\x) — min {c{x , a) + R{h\x , a)} 

aeA(x) 

which is the AROE. 

□ 

Remarks In classical MDPs with finite state-action space, R{v\x,a) = J^yexQivl^^ '^)'"iy)- 
can be shown (Theorem 8.5.3, PB:) that if for each a G A, Q(y\x,a) is recurrent (irreducible 
and aperiodic), then there exists some integer K and one state y G X such that for all /("-stage 
deterministic policies tt G A|^, P'^{y\x) > 0, for all x G X, and furthermore Assumption 14.11 holds 
true (Theorem 8.5.2 and 6.6.2, |26l). Therefore, for classical MDPs, Assumption 14. II is equivalent 
to the conventional assumption of recurrence. 

However, for general prospect maps, how to easily verify whether Assumption 14.11 is satisfied 
is still an open question. Some insights were given in [12!, where the properties of general topical 
maps are investigated via the associated graph, G^R), of the topical map R. They found some 
sufficient conditions (strongly connectedness of the associated graph) for guaranteeing the existence 
of fixed point in Hilbert space. However, we find that the conditions would fail for entropic maps 
(for definition see Eq. [T5]) when A < 0. We should leave this job as future work. 

Value iteration Based on Theorem 14.21 we state the following algorithm: 

1. select one vq G M^, t — 0; 

2. calculate vt+i = F{vt); ft = argmaxF(wt) 

3. if ll^t-i-i — vt\\H < e, stop; otherwise t <^ t + 1 and goto step 2. 

Theorem l4. 21 guarantees that vt — >■ h, vt+i — vt ^ p and ft f* as t ^ oo, where h and p are the 
solutions to the APOE and /* denotes the optimal policy of the average prospect problem. More 
specifically, p is equivalent to J*. 
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5 Examples 



There exist several important risk-sensitive maps in the Hterature of economics, finance and control 
theory. Most of them can be adapted to the framework we introduced above. We assume the 
prospect maps under consideration are time-homogeneous and Markov. Suppose Assumption 13.11 
holds true. Therefore the form of R(v\x,a) determines the prospect map that we select. In the 
following part, we simply state the form of conditional prospect map for each specific prospect 
map. There are only a few exceptions in literature where the maps they used are not within our 
framework, e.g. value at risk [17] and mean- variance trade-off for entire MDPs [18]. In fact, in 
those exceptions, dynamic programming can not be applied due to lack of time-consistency and 
therefore, exact solutions to the optimization problems are usually computationally infeasible for 
high dimensional state-action space. 

Classical MDPs [1 [26] 

Riv\x, a) E^,, [v{X')] = ^ Qiy\x, a)v{y) (14) 
It is easy to verify that R is coherent and linear to v (therefore risk- neutral) . 



Entropic maps The name is taken from the literature of CRM [TT] . It is also lengthily researched 
in the operations research (e.g., Borkar (2002) [5] and references therein) and the control theory 
(e.g., Coraluppi & Marcus (2000), [5] and references therein), due to its good properties. 



Riv\x,a) := ^logE^^Jexp(A«(X'))] = ^ log <j E Qivl^^a) eMMv)) \ (15) 



where the risk-sensitive parameter A G R controls the risk-preference of the risk map R: if A > 0, 
R is everywhere convex and therefore everywhere risk-seeking; if A < 0, i? is everywhere concave 
and therefore everywhere risk-aversive. It can be also shown that 

Im ^ log ^ X! Q(y\^^ exp(Aw(?/)) i = ^ Q{y\x, a)v{y) 
y V ) y 

which is exactly the conditional map of classical MDPs. Besides, it has connection to the mean- 
variance trade-off scheme via the following Taylor expansion at A = 0, 

i logEcxp(AZ) =EZ + AVar(Z) + 0(A) 
A 

where Z denotes arbitrary random variable. Suppose that risk is measured by variance. The 
objective is to maximize R'^ (see Sec. 14. 1[) . Therefore, if A < 0, the variance is avoided, the agent 
is intuitively risk-aversive. Conversely, if A > 0, the variance is preferred, the agent is intuitively 
risk-seeking. These intuitions coincide the categorization based on the convexity (concavity) of R. 

Remark. There are some literature (e.g., Borkar (2002) [5]) that do not satisfy the Assumption 
13.11 we make. Instead of R'^{v\x) = 7r(a|a;)i?(D|a;, a), they define 

R^iv\x) := jlogl ^P'^(2/|a;)exp(A«(2/)) 

i.e., TT is inside the log function rather than outside the log function as in our definition. However, 
the optimal policy they find is still deterministic and is equivalent to the optimal policy according 
to our definition. In this sense, there is no essential difference between their definition and ours. 
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Robust maps Iyengar (2005) [TH] invented the framework of robust dynamic programming. He 
argues that in some appHcations the transition model Q can not be inferred exactly. Instead, he 
employs a set of transition probabilities, V, which contains all possible "ambiguous" transition 
probabilities. In order to gain the "robustness", the worst cost (i.e., lowest reward) is considered, 
adapted in our framework, i.e., 

R{v\x,a) := inf EQ,«(X') = inf ^ Qiy\x,a)v{y) 

It is apparent that R is coherent. We can verify that R is everywhere concave and therefore 
risk-aversive, which coincides the intuition that the worst scenario is considered. One special case 
of the robust dynamic programming was the minimax control (details see Coraluppi, 1997 |31)), 
which also considers the worst scenario and can be used only in finite state space, 

R{v\x,a) := min v{y) 

Q(y\x,a)>0 

Conditional average value at risk has important applications in finance (see e.g. [27)1. 
Adapted to our framework, it can be defined as, 

R{v\x, a) = sup \u + -E^^ MX') - u)+]\ (16) 

where (2:)+ denotes z V 0. i? is coherent and everywhere concave. Therefore, this prospect map is 
risk-aversive. 

Mean-semideviation maps [53] This map considers only the trade-off between the one-step 
conditional mean and semideviation (see Eq. below) rather than the deviation of the whole Markov 
chain (see [30l|T0]). 

Riv\x,a) ■.= EQjviX')]+\EQ^ [(«(X') - EQjt;(X')])1 (17) 

where r > 1 and A G M denotes the risk-preference parameter which controls the risk-preference 
of R: if A < 0, i? is risk-aversive; if A > 0, i? is risk-seeking. It can be shown that R is concave 
with negative A whereas convex with positive A. This map was used by Gosavi (2006) [13] (with 
setting r -|- 2) to approximate the mean- variance trade-off scheme defined in |10) . 

Probability weighting maps Consider the following map 

R{r\x,a) -.^EQ^MriX'))] := ^ ti;(g(2/|a;, a))u(r(y)) 

where u(-) and w(-) denote utility function and probability weighting function respectively, u is 
assumed to be a monotonically increasing function and satisfying u{0) — 0. ly(-) is also monoton- 
ically increasing and satisfies w{0) = and w(l) = 1. Note that for general utility functions, the 
map R constructed above satisfies the axiom of monotonicity and centralization, but not necessar- 
ily the axiom of time-consistency. In order to amend the problem, we replace immediate rewards 
r{x,a) in defining total prospect (see Eq. [S] and H]) with its utility u{r{x,a)). 

This map has a long history in economics to model mixed risk-preference, which is determined 
by the setting of utility function and probability weighting functions (a nice review see [32]). 
However, in economics, it is only used to model single decision problem. Here we generalize it to 
adapt to the temporal structure of MDPs. 
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Choquet integral was applied in Chateauneuf & Cohen (2008) [5] to model the subjective ex- 
pected utility [29^. The Choquet integral is determined by a non-additive measure /i, which satisfies 
the monotonicity and centralization. Furthermore, Choquet integral is coherent. Chateauneuf & 
Cohen (2008) focus on the one-step decision problem. However, it is trivial to extend the theory 
to the MDP structure. In fact, Choquet integral can be viewed as a coherent prospect map, which 
is a special case within our general framework. 



6 Reinforcement Learning Approaches 

In real applications, the transition model Q and reward function r are not known before explor- 
ing the system and collecting samples. Reinforcement learning (RL) approaches like temporal- 
difference learning [33] and Q-learning, are popular online algorithms where Q and r are not 
required. These approaches belong to a the class of stochastic approximation algorithms [22], 
whose key point is that the conditional prospect map R{v) is linear in the transition model Q. The 
conditional prospect map of classical MDPs has this property (cf. Eq. fH)) . However, the transition 
model Q is usually necessary to compute the conditional prospect map and therefore the stochastic 
approaches like Q-learning are in general not applicable except for some specific maps that can be 
transformed to an equivalent map which is linear in Q. This is, for example, the case for entropic 
maps. 



Q-Learning for Entropic Measure We consider only the risk-aversive case, i.e. A < 0, while 
the risk-seeking case can be dealt with similarly. Substitute Eq. [TS] (with discount factor a) into 
the value iteration, 

vt+iix) = max|r(a;,a) + ^ logE^ Je^^']} ^ e^^^+i^^") niin {e^''(^'")Ej, Je^^']} 

Let w := exp(^w). The above equation is equivalent to Wt+i{x) — miua |e°''('^'")E[(u>t)"|a;, a]|, 
which is linear in the transition model Q. Observed the state xt, action a*, the reward rt at time 
t and the successive state xt+i, the update rule of Q-learning for w (instead of v) is 

qt+i{xt,at) ^ qtixt,at) -\- I3t e^''' mm{qt{xt+i,a))°' - qt{xt,at) (18) 

La J 

where /3t denotes the learning rate. 



Model-based Approaches for General Prospect Maps We introduce an algorithm similar 
to the dyna-Q approach ^33) . Repeat the following procedure until convergence: 

1. Given data {xt,at,Xt+i,rt) update the model estimates Q*-*-* and f'*^ 

2. Update the Q-value at {xt,at) based on the estimated models Q^*^ and f*^*-' by q{xt,at) = 
f^*\xt,at) -I- p(7maxa g(a;t+i, a)|xt, at , Q^*)) 

3. Perform k additional updates: choose k state-action pairs at random and update them according 
to the same rule: q{xk,ak) = f^*\xk,ak) + p(maxa 75(2:^+1, a)|xfc, a^, Q*^*^) 

4. Choose an action at+i at state Xt+i, based on the softmax or e-greedy policy. Go to Step 1. 



7 Experiments with Simple MDPs 

By presenting two experiments, we aim to illustrate by the first experiment the capability of 
modeling mixed risk-preference via designing new prospect map under our framework, and by the 
second experiment to verify the effectiveness of the Q-learning algorithm introduced in Sec. [6] 
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Experiment 1: Sequential Betting Game The prospect simply expresses preferences of the 
agent and there is no right or wrong choice. The framework we introduced here is a set of options 
for the agent and we want to show in this first experiment that this set contains useful options. 
Furthermore, we compare the solutions of optimizing different maps to see whether there is a 
significant difference of resulting optimal policies. 

The task is a game that has two stages and at both stages the agent faces two options: he 



can either bet or do nothing (denoted with bet and no in Fig. 1(a) I. At the 1st stage (State 1 in 



Fig. 1(a) ), if "bet" is chosen then the agent will be rewarded with $100 with a 5% chance and $0 



with 95%; if "no" is chosen then he gains $5. At the 2nd stage (State 4 in Fig. 1(a) I, if "bet" is 
chosen then the agent will suffer a loss of $100 with 5% chance and $0 with 95%; if "no" is chosen, 
then he loses $5. From stage 2 he goes back to stage 1 to repeat the game. The discounting factor 
is set to 0.99. 



$100 

bet, 5%^^^ { 2 






bet, bet 


no, no 1 





-9.5 -9 
loa(B) 



(a) MDP (b) Value of State 1 vs A (in figure is , 

Figure 1: Sequential Betting Game 



The betting game expresses in a way the essence of risk choices: risk-neutral behavior will be 
indifferent between the options "bet" and "not bet" as the mean reward for the two options is the 
same in both stages. It is thus a question of risk-preference what should be done. A risk-seeking 
person or company will prefer the "bet" option. A risk-aversive user will prefer "not bet" . Also a 
mixed risk-preference might be preferable. For example, people might be happy to bet when they 
can't lose money (stage 1) but might not be happy to bet if they have to pay if they lose (stage 2). 

The total reward expresses the risk-neutral option and a user that is indifferent to risk should 
chose this criterion. The entropic map can both encode risk-aversion and risk-seeking depending 
on its parameter A. We calculated the optimal policy using entropic maps for different values of A 
with the value iteration (cf. Sec. 14. 2p and we plotted the results in Fig. 1(b) (left). A is shown on 
the X-axis and entropic value of state 1 is shown on the y-axis (Eq. I15|) . As expected the optimal 
policy is risk-seeking if the A is positive, i.e. the policy chooses "bet, bet", and risk-aversive if the 
A is negative, i.e. the policy chooses "no, no" at state 1 and 2. 

The curve can intuitively be explained with the help of the mean- variance criterion. For small 
A the entropic map approximates the mean-variance criterion and for the "no, no" policy both the 
variance is and hence we expect a value close to mean which is 0.92. 

The second map, mixed entropic map, is constructed from entropic maps. 



Riv\x,a) -.^^-HogEQ [e^''],7 



A ifEQ,[e^"]>l 
—A otherwise 



for some A > 0. It is easy to check that this map satisfies the axioms of prospect maps and is 
convex (risk-seeking) if w > and concave (risk-aversive) if w < 0. However, in the whole space the 
measure is neither convex nor concave. The risk-preference is controlled by A > 0. The result is 



shown in Fig. 1(b) (right). For small A, the optimal policy chosen by this measure is as expected: 



risk-seeking "bet" when facing gain but risk-aversive "not bet" when facing loss. 
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Grid World: Q-Learning In this experiment, we consider the 11 x 11 grid world depicted in 



Fig. 2(a) The agent will obtain a small reward = 3 if hitting the upper-right corner (marked 
"S") and a large reward = 15 if arriving at the lower- left corner (marked "L"). The shadowed 
grids denote "dangerous" states where the agent will be punished by a negative reward ro — —5. 
In real applications, "dangerous" states can model the uncertain areas where punishments might 
be incurred. 

The agent has four actions, "left", "right", "up" and "down" at each state. Choosing "left", 
the agent will deterministically go to the left neighboring state. If the agent is on the left boundary 
of the grid world, choosing "left" the agent will stay at the current state. The transitions of other 3 
actions are defined similarly. However, if the agent is in one "dangerous" state and chooses "left" , 
the probability of arriving at the left neighboring state (probability of escape) is only Pg < 1 
and the agent will stay at the "dangerous" state with probability 1 — Pe- For all 4 actions, the 
probability of escape is set to the same value. 

We start from the upper- left corner (marked by a black point). The classical MDP with high 
discount factor will choose the path hitting the large-reward state "L" (depicted with red arrows in 
Fig. 2(a) I. However, since the large-reward state is surrounded by dangerous states, a risk-aversive 
agent will dislike the policy that generates the highest average reward and instead choose the safer 
path (black arrow path in Fig. 2(a) ) that avoids the all dangerous states. 

We apply Q-learning to solve optimization problems for both classical MDP and entropic map 
(cf. Sec. 0. The same setup for both maps is used. Totally 200 episodes and each episode 250 
steps are run. At the beginning of each episode, the state St is reset to the start state (upper-left 
corner). The learning rate a of each state- action pair decays propositional to the times of visiting 
the pair. The action of each update step is chosen by a e-greedy policy where e decays propositional 
to episode number. Therefore, in early episodes, e is high to encourage exploration. At the end of 
each episode, a greedy policy is calculated by the current Q-value and the performance is evaluated 
by the value of the start state, vi , generated by the learned policy. The Q-learning is considered to 
have converged, if the value is very close to the optimal value of the start state u* that is calculated 
by value iteration (we know the transition model and reward model) before running Q-learning 
algorithm. 

Fig. |2(b)] plots the absolute difference \vi — vl\ during learning procedure. The black curve 
depicts the result averaged over 200 random trials of entropic map with A = 0.01 and a = 0.9 
while the red dotted curve show the average result of classical MDP withA = 0.9. Both maps have 
the same optimal policy that finally hits the large-reward state and have the similar optimal value 
vl w 36. Therefore, the Q-learning for both maps should have similar behavior and performance, 
which is confirmed by the Fig. |2(b)[ They have almost the same decay speed and therefore the 
same convergence speed. 



References 

[1] B. Acciaio, H. Follmer, and I. Penner. Risk assessment for uncertain cash flows: model 
ambiguity, discounting ambiguity, and the role of bubbles. Arxiv preprint arXiv: 1002.3627, 
2010. 

[2] P. Artzner, F. Delbaen, J.M. Eber, and D. Heath. Coherent measures of risk. Mathematical 
finance, 9(3):203-228, 1999. 

[3] D.P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific Belmont, MA, 
1995. 

[4] D.P. Bertsekas and S.E. Shreve. Stochastic Optimal Control: The Discrete Time Case, volume 
139. Academic Pr, 1978. 



17 





■ 


















s 




i 


■ 




■ 












t 






—> 




— c> 


— c> 










t 




I 


■ 














■ 




■ 


I 






















i 






















i 






















i 




















i 






















m 


■ 




















L 


■ 





















(a) 11 X 11 Grid World 



45 
40 
TO 35 

fso 

°25 
i 20 
03 15 
10 

^0 40 80 120 160 200 

(b) Error between learn value and optimal value 




Figure 2: Grid World 



[5] VS Borkar and SP Mcyn. Risk-sensitive optimal control for markov decision processes with 
monotone cost. Mathematics of Operations Research, pages 192-209, 2002. 

[6] A. Chateauneuf and M. Cohen. Cardinal extensions of the eu model based on the choquet 
integral. Decision-making Process, pages 401-433, 2008. 

[7] K.J. Chung and M.J. Sobel. Discounted mdps: distribution functions and exponential utility 
maximization. SIAM Journal on Control and Optimization, 25:49, 1987. 

[8] S.P. Coraluppi and S.I. Marcus. Mixed risk-neutral/minimax control of discrete-time, finite- 
state markov decision processes. Automatic Control, IEEE Transactions on, 45(3):528-532, 
2000. 

[9] K. Detlefsen and G. Scandolo. Conditional and dynamic convex risk measures. Finance and 
Stochastics, 9(4):539-561, 2005. 

[10] J. A. Filar, LCM Kallenberg, and H.M. Lee. Variance-penalized markov decision processes. 
Mathematics of Operations Research, pages 147-161, 1989. 

[11] H. Follmer and A. Schied. Convex measures of risk and trading constraints. Finance and 
Stochastics, 6(4):429-447, 2002. 

[12] S. Gaubert and J. Gunawardena. The perron-frobenius theorem for homogeneous, monotone 
functions. Transactions American Mathematical Society, 356(12):4931-4950, 2004. 

[13] C. Collier. The Economics of Risk and Time. The MIT Press, 2004. 

[14] A. Gosavi. A risk-sensitive approach to total productive maintenance. Automatica, 42(8):1321- 
1330, 2006. 

[15] O. Hernandez-Lerma. Adaptive Markov Control Processes, volume 79. Springer, 1989. 

[16] O. Hernandez-Lerma and J.B. Lasserre. Discrete-time Markov Control Processes: Basic Op- 
timality Criteria. Springer, 1996. 



[17] G.A. Holton. Value- at-risk: theory and practice. Academic Press, 2003. 



18 



[18] Y. Huang and LCM Kallenberg. On finding optimal policies for Markov decision chains: a 
unifying framework for mean-variance-tradeoffs. Mathematics of operations research, pages 
434-448, 1994. 

[19] G.N. Iyengar. Robust dynamic programming. Mathematics of Operations Research, pages 
257-280, 2005. 

[20] A. Jobert and L.C.G. Rogers. Valuations and dynamic convex risk measures. Mathematical 
Finance, 18(l):l-22, 2008. 

[21] D. Kahneman and A. Tversky. Prospect theory: an analysis of decision under risk. Econo- 
metrica: Journal of the Econometric Society, pages 263-291, 1979. 

[22] H.J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications, 
volume 35. Springer Verlag, 2003. 

[23] S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in markov decision processes. 
Submitted. http://www.mit.edu/~jnt/Papers/P-10-mv-MDP-sub.pdf, 2010. 

[24] R.D. Nussbaum. Hilherts Projective Metric and Iterated Nonlinear Maps. American Mathe- 
matical Society, Providence, RI, 1988. 

[25] W. Ogryczak and A. Ruszczynski. From stochastic dominance to mean-risk models: Semide- 
viations as risk measuresl. European Journal of Operational Research, 116(l):33-50, 1999. 

[26] M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. 
John Wiley & Sons, Inc., 1994. 

[27] R.T. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. 
Journal of Banking & Finance, 26(7):1443-1471, 2002. 

[28] A. Ruszczynski. Risk-averse dynamic programming for markov decision processes. Mathemat- 
ical Programming, pages 1-27, 2010. 

[29] L.J. Savage. The foundations of statistics. Dover Pubns, 1972. 

[30] M.J. Sobel. The variance of discounted markov decision processes. Journal of Applied Prob- 
ability, pages 794-802, 1982. 

[31] S.P.Coraluppi. Optimal Control of Markov Decision Processes for Performance and Robust- 
ness. PhD thesis. University of Maryland, 1997. 

[32] C. Starmer. Developments in non-expected utility theory: The hunt for a descriptive theory 
of choice under risk. Journal of Economic Literature, 38(2):332-382, 2000. 

[33] R.S. Sutton and A.G. Barto. Reinforcement learning, volume 9. MIT Press, 1998. 



19 



